1 Introduction

When working on large projects that involve tasks such as training machine learning models or re-rendering Quarto documents after minor changes like correcting a typo, you can end up wasting a lot of time and resources.

Hence the need for a package like {targets} that acts like a smart assistant that knows exactly which parts of your code need to run, when they need to run, and how they depend on each other. Just like a good recipe breaks down cooking into manageable steps, {targets} helps you organize your R workflow into clear, connected pieces that automatically update when needed.

What makes {targets} special is its intelligent dependency tracking; it understands the relationships between different parts of your workflow. Changed your raw data? {targets} knows to update your models. Modified your cleaned data code? It’ll automatically refresh your visualizations. No more manually running scripts in the right order or wondering if your results are up to date.

2 Initial Setup

You first need to install and load the package. If you haven’t installed it yet, you can do so by uncommenting the first line of the code below then load the package and initialize a new project with use_targets().

#install.packages('targets')

library(targets)
use_targets()

The last line will create and open a _targets.R file that will be used to define our pipelines or ‘steps’ in simpler terms.

3 Pipeline Setup

As an example we will explore a demo that I did based on a previous report I posted.

3.1 Loading the needed packages

To load packages you need for your pipeline, use the tar_option_set() function, it allows you to specify the packages that should be loaded when your pipeline runs.

_targets.R

# Load packages required to define the pipeline:
library(targets)
library(tarchetypes) # Load other packages as needed.

# Set target options:
tar_option_set(
  packages = c("tidyverse", "e1071", "plotly", 
               "rpart", "glmnet", "tinytable",
               "kableExtra", "modelsummary", "caret",
               "rsample") # Packages that your targets need for their tasks.

3.2 Defining functions

In order to create targets, we need a file containing functions that have outputs for each object we need from clean data to visualizations to fitted models.

In this example, we define a function that takes a file as an input and reads it in order to finally output a clean version of our dataset:

scripts/functions.R

get_data <- function(file){
  read_csv(file) %>% 
    mutate(rental_days = as.numeric((return_date - rental_date) / 24)) %>%
    select(rental_days, everything(), -c(rental_date, return_date)) %>% 
    mutate(trailers = ifelse(grepl("Trailers", special_features), 1, 0),
           behind_the_scenes = ifelse(grepl("Behind the Scenes", special_features), 1, 0),
           commentaries = ifelse(grepl("Commentaries", special_features), 1, 0),
           deleted_scenes = ifelse(grepl("Deleted Scenes", special_features), 1, 0))
}

Of course you will be defining more than this type of function ranging from creating plots to training models.

3.3 Defining targets

In order to benefit from the package, we should define what it calls ‘targets’; objects that get monitored for changes and updated when needed (more like a listener for those familiar with network).

_targets.R

tar_source(files = "scripts")
# tar_source("other_functions.R") # Source other scripts as needed.

# Replace the target list below with your own:
list(
  tar_target(file, "rental_info.csv", format = "file"),
  tar_target(data, get_data(file)),
  tar_target(movies_features, movies_features_plot(data)),
  tar_target(movies_ratings, movies_ratings_plot(data)),
  tar_target(movies_release_years, movies_release_years_plot(data)),
  tar_target(summary, data_summary(data)),
  tar_target(corr, corr_table(data)),
  tar_target(model, lm_model(data))
)

3.4 Displaying your project pipeline

To build your pipeline, we have to run the following code:

tar_make()

▶ dispatched target file
● completed target file [3.489 seconds, 1.941 megabytes]
▶ dispatched target data
Rows: 15861 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): special_features
dbl  (9): amount, release_year, rental_rate, length, replacement_cost, NC-17...
dttm (2): rental_date, return_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
● completed target data [0.484 seconds, 82.18 kilobytes]
▶ dispatched target summary
● completed target summary [0.115 seconds, 338.928 kilobytes]
▶ dispatched target movies_features
● completed target movies_features [0.027 seconds, 84.632 kilobytes]
▶ dispatched target corr
● completed target corr [0.76 seconds, 4.387 kilobytes]
▶ dispatched target movies_release_years
● completed target movies_release_years [0.019 seconds, 187.214 kilobytes]
▶ dispatched target movies_ratings
● completed target movies_ratings [0.016 seconds, 84.65 kilobytes]
▶ dispatched target model
● completed target model [0.564 seconds, 3.866 megabytes]
▶ ended pipeline [6.786 seconds]

To visualize your pipeline, we use tar_visnetwork().

tar_visnetwork()

3.5 Reading/Loading objects

After building your pipeline, you can now call each object you created using tar_read() to inspect the content of it or tar_load() to load it directly into your environment. You’ll find the names of all your targets mentioned in the result of tar_manifest()

tar_manifest()

# A tibble: 8 × 2
  name                 command                          
  <chr>                <chr>                            
1 file                 "\"rental_info.csv\""            
2 data                 "get_data(file)"                 
3 summary              "data_summary(data)"             
4 movies_features      "movies_features_plot(data)"     
5 corr                 "corr_table(data)"               
6 movies_release_years "movies_release_years_plot(data)"
7 movies_ratings       "movies_ratings_plot(data)"      
8 model                "lm_model(data)"

tar_read(data) |> head()

# A tibble: 6 × 15
  rental_days amount release_year rental_rate length replacement_cost
        <dbl>  <dbl>        <dbl>       <dbl>  <dbl>            <dbl>
1        3.87   2.99         2005        2.99    126             17.0
2        2.84   2.99         2005        2.99    126             17.0
3        7.24   2.99         2005        2.99    126             17.0
4        2.1    2.99         2005        2.99    126             17.0
5        4.05   2.99         2005        2.99    126             17.0
6        3.20   2.99         2005        2.99    126             17.0
# ℹ 9 more variables: special_features <chr>, `NC-17` <dbl>, PG <dbl>,
#   `PG-13` <dbl>, R <dbl>, trailers <dbl>, behind_the_scenes <dbl>,
#   commentaries <dbl>, deleted_scenes <dbl>

movies_plot <- tar_read(movies_features)
movies_plot

tar_load(model)
formula(model)

rental_days_train ~ amount + release_year + rental_rate + length + 
    replacement_cost + special_features + `NC-17` + PG + `PG-13` + 
    R + trailers + behind_the_scenes + commentaries + deleted_scenes
<environment: 0x55c602d651a0>

3.6 The useful part: Updating the targets

When modifying one of the targets we created, {targets} easily detects that and only re-runs the necessary functions.

This code is only useful for me when rendering this post, you can directly change your file by accessing it on your IDE.

functions_script <- readLines('scripts/functions.R')
modified_code <- 
  "movies_features_plot <- function(data){
    data %>%
      # The change we added to our function
      filter(release_year == 2005) %>% 
      select(trailers, behind_the_scenes, commentaries, deleted_scenes) %>%
      summarise(trailers = sum(trailers), behind_the_scenes = sum(behind_the_scenes),
                commentaries = sum(commentaries), deleted_scenes = sum(deleted_scenes)) %>%
      pivot_longer(cols = everything(), names_to = 'Special Feature', values_to = 'Count') %>%
      plot_ly(labels = ~`Special Feature`, values = ~Count, type = 'pie') %>% 
      layout(title = list(text = 'Movies Features during 2005', x = 0),
             paper_bgcolor = '#FFF1E5', 
             plot_bgcolor = '#FFF1E5', 
             xaxis = list(gridcolor = 'gray1'),
             yaxis = list(gridcolor = 'gray1'))
  }"

functions_script <- c(functions_script, modified_code)

writeLines(functions_script, 'scripts/functions.R')

The visnetwork function now states that the plot function is outdated which caused the plot to also be so, to update it we only need to run tar_make().

tar_visnetwork()

tar_make()

✔ skipped target file
✔ skipped target data
✔ skipped target summary
▶ dispatched target movies_features
● completed target movies_features [0.038 seconds, 84.69 kilobytes]
✔ skipped target corr
✔ skipped target movies_release_years
✔ skipped target movies_ratings
✔ skipped target model
▶ ended pipeline [3.187 seconds]

Now it displays the new modified plot without running the rest of the code.

tar_read(movies_features)

4 Conclusion

The {targets} package is an invaluable tool for data scientists working on large and complex projects. By automating dependencies, caching results and ensuring that your workflow is efficient, {targets} can significantly enhance productivity and prevent time wasting.